The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Sun Jun 21 20:01:13 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Sun Jun 21 20:01:13 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.20.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.20.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.20.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.20.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 151 151
Albania 151 151
Algeria 151 151
Andorra 151 151
Angola 151 151
Antigua and Barbuda 151 151
Argentina 151 151
Armenia 151 151
Australia 1208 1208
Austria 151 151
Azerbaijan 151 151
Bahamas 151 151
Bahrain 151 151
Bangladesh 151 151
Barbados 151 151
Belarus 151 151
Belgium 151 151
Belize 151 151
Benin 151 151
Bhutan 151 151
Bolivia 151 151
Bosnia and Herzegovina 151 151
Botswana 151 151
Brazil 151 151
Brunei 151 151
Bulgaria 151 151
Burkina Faso 151 151
Burma 151 151
Burundi 151 151
Cabo Verde 151 151
Cambodia 151 151
Cameroon 151 151
Canada 2114 2114
Central African Republic 151 151
Chad 151 151
Chile 151 151
China 4983 4983
Colombia 151 151
Comoros 151 151
Congo (Brazzaville) 151 151
Congo (Kinshasa) 151 151
Costa Rica 151 151
Cote d’Ivoire 151 151
Croatia 151 151
Cuba 151 151
Cyprus 151 151
Czechia 151 151
Denmark 453 453
Diamond Princess 151 151
Djibouti 151 151
Dominica 151 151
Dominican Republic 151 151
Ecuador 151 151
Egypt 151 151
El Salvador 151 151
Equatorial Guinea 151 151
Eritrea 151 151
Estonia 151 151
Eswatini 151 151
Ethiopia 151 151
Fiji 151 151
Finland 151 151
France 1661 1661
Gabon 151 151
Gambia 151 151
Georgia 151 151
Germany 151 151
Ghana 151 151
Greece 151 151
Grenada 151 151
Guatemala 151 151
Guinea 151 151
Guinea-Bissau 151 151
Guyana 151 151
Haiti 151 151
Holy See 151 151
Honduras 151 151
Hungary 151 151
Iceland 151 151
India 151 151
Indonesia 151 151
Iran 151 151
Iraq 151 151
Ireland 151 151
Israel 151 151
Italy 151 151
Jamaica 151 151
Japan 151 151
Jordan 151 151
Kazakhstan 151 151
Kenya 151 151
Korea, South 151 151
Kosovo 151 151
Kuwait 151 151
Kyrgyzstan 151 151
Laos 151 151
Latvia 151 151
Lebanon 151 151
Lesotho 151 151
Liberia 151 151
Libya 151 151
Liechtenstein 151 151
Lithuania 151 151
Luxembourg 151 151
Madagascar 151 151
Malawi 151 151
Malaysia 151 151
Maldives 151 151
Mali 151 151
Malta 151 151
Mauritania 151 151
Mauritius 151 151
Mexico 151 151
Moldova 151 151
Monaco 151 151
Mongolia 151 151
Montenegro 151 151
Morocco 151 151
Mozambique 151 151
MS Zaandam 151 151
Namibia 151 151
Nepal 151 151
Netherlands 755 755
New Zealand 151 151
Nicaragua 151 151
Niger 151 151
Nigeria 151 151
North Macedonia 151 151
Norway 151 151
Oman 151 151
Pakistan 151 151
Panama 151 151
Papua New Guinea 151 151
Paraguay 151 151
Peru 151 151
Philippines 151 151
Poland 151 151
Portugal 151 151
Qatar 151 151
Romania 151 151
Russia 151 151
Rwanda 151 151
Saint Kitts and Nevis 151 151
Saint Lucia 151 151
Saint Vincent and the Grenadines 151 151
San Marino 151 151
Sao Tome and Principe 151 151
Saudi Arabia 151 151
Senegal 151 151
Serbia 151 151
Seychelles 151 151
Sierra Leone 151 151
Singapore 151 151
Slovakia 151 151
Slovenia 151 151
Somalia 151 151
South Africa 151 151
South Sudan 151 151
Spain 151 151
Sri Lanka 151 151
Sudan 151 151
Suriname 151 151
Sweden 151 151
Switzerland 151 151
Syria 151 151
Taiwan* 151 151
Tajikistan 151 151
Tanzania 151 151
Thailand 151 151
Timor-Leste 151 151
Togo 151 151
Trinidad and Tobago 151 151
Tunisia 151 151
Turkey 151 151
Uganda 151 151
Ukraine 151 151
United Arab Emirates 151 151
United Kingdom 1661 1661
Uruguay 151 151
US 151 151
US_state 492411 492411
Uzbekistan 151 151
Venezuela 151 151
Vietnam 151 151
West Bank and Gaza 151 151
Western Sahara 151 151
Yemen 151 151
Zambia 151 151
Zimbabwe 151 151
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 5961
Alaska 1216
Arizona 1451
Arkansas 6330
California 5529
Colorado 5387
Connecticut 862
Delaware 363
Diamond Princess 96
District of Columbia 97
Florida 6266
Georgia 14196
Grand Princess 97
Guam 97
Hawaii 497
Idaho 2899
Illinois 8381
Indiana 8217
Iowa 7880
Kansas 6695
Kentucky 9466
Louisiana 5905
Maine 1508
Maryland 2273
Massachusetts 1436
Michigan 7336
Minnesota 7000
Mississippi 7330
Missouri 8494
Montana 2605
Nebraska 5179
Nevada 1176
New Hampshire 1027
New Jersey 2169
New Mexico 2580
New York 5541
North Carolina 8736
North Dakota 3205
Northern Mariana Islands 82
Ohio 7765
Oklahoma 6048
Oregon 2991
Pennsylvania 6075
Puerto Rico 97
Rhode Island 574
South Carolina 4250
South Dakota 4062
Tennessee 8435
Texas 18430
Utah 1318
Vermont 1380
Virgin Islands 97
Virginia 11109
Washington 3775
West Virginia 4181
Wisconsin 6002
Wyoming 1872
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 67669
China 151
Diamond Princess 132
Korea, South 122
Japan 121
Italy 119
Iran 116
Singapore 113
France 112
Germany 112
Spain 111
US 109
Switzerland 108
United Kingdom 108
Belgium 107
Netherlands 107
Norway 107
Sweden 107
Austria 105
Malaysia 104
Australia 103
Bahrain 103
Denmark 103
Canada 102
Qatar 102
Iceland 101
Brazil 100
Czechia 100
Finland 100
Greece 100
Iraq 100
Israel 100
Portugal 100
Slovenia 100
Egypt 99
Estonia 99
India 99
Ireland 99
Kuwait 99
Philippines 99
Poland 99
Romania 99
Saudi Arabia 99
Indonesia 98
Lebanon 98
Pakistan 98
San Marino 98
Thailand 98
Chile 97
Luxembourg 96
Peru 96
Russia 96
Ecuador 95
Mexico 95
Slovakia 95
South Africa 95
United Arab Emirates 95
Armenia 94
Colombia 94
Croatia 94
Panama 94
Serbia 94
Taiwan* 94
Turkey 94
Argentina 93
Bulgaria 93
Latvia 93
Uruguay 93
Algeria 92
Costa Rica 92
Dominican Republic 92
Hungary 92
Andorra 91
Bosnia and Herzegovina 91
Jordan 91
Lithuania 91
Morocco 91
New Zealand 91
North Macedonia 91
Vietnam 91
Albania 90
Cyprus 90
Malta 90
Moldova 90
Brunei 89
Burkina Faso 89
Sri Lanka 89
Tunisia 89
Ukraine 88
Azerbaijan 87
Ghana 87
Kazakhstan 87
Oman 87
Senegal 87
Venezuela 87
Afghanistan 86
Cote d’Ivoire 86
Cuba 85
Mauritius 85
Uzbekistan 85
Cambodia 84
Cameroon 84
Honduras 84
Nigeria 84
West Bank and Gaza 84
Belarus 83
Georgia 83
Bolivia 82
Kosovo 82
Kyrgyzstan 82
Montenegro 82
Congo (Kinshasa) 81
Kenya 80
Niger 79
Guinea 78
Rwanda 78
Trinidad and Tobago 78
Paraguay 77
Bangladesh 76
Djibouti 74
El Salvador 73
Guatemala 72
Madagascar 71
Mali 70
Congo (Brazzaville) 67
Jamaica 67
Gabon 65
Somalia 65
Tanzania 65
Ethiopia 64
Burma 63
Sudan 62
Liberia 61
Maldives 59
Equatorial Guinea 58
Cabo Verde 56
Sierra Leone 54
Guinea-Bissau 53
Togo 53
Zambia 52
Eswatini 51
Chad 50
Tajikistan 49
Haiti 47
Sao Tome and Principe 47
Benin 45
Nepal 45
Uganda 45
Central African Republic 44
South Sudan 44
Guyana 42
Mozambique 41
Yemen 37
Mongolia 36
Mauritania 33
Nicaragua 33
Malawi 27
Syria 27
Zimbabwe 25
Bahamas 24
Libya 24
Comoros 22
Suriname 14
Angola 11
Eritrea 6
Burundi 5
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 151
Korea, South 122
Japan 121
Italy 119
Iran 116
Singapore 113
France 112
Germany 112
Spain 111
US 109
Switzerland 108
United Kingdom 108
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-01-24        18285
## 2    Afghanistan           <NA> <NA> 2020-05-25        18407
## 3    Afghanistan           <NA> <NA> 2020-01-25        18286
## 4    Afghanistan           <NA> <NA> 2020-01-26        18287
## 5    Afghanistan           <NA> <NA> 2020-05-24        18406
## 6    Afghanistan           <NA> <NA> 2020-05-23        18405
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      0                     0            NaN
## 2                    219                 11173     0.01960082
## 3                      0                     0            NaN
## 4                      0                     0            NaN
## 5                    218                 10582     0.02060102
## 6                    216                  9998     0.02160432
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                      -Inf                       -Inf        18348
## 2                  4.048170                   2.340444        18348
## 3                      -Inf                       -Inf        18348
## 4                      -Inf                       -Inf        18348
## 5                  4.024568                   2.338456        18348
## 6                  3.999913                   2.334454        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1            -63  NA   NA         NA                           NA
## 2             59  NA   NA         NA                           NA
## 3            -62  NA   NA         NA                           NA
## 4            -61  NA   NA         NA                           NA
## 5             58  NA   NA         NA                           NA
## 6             57  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit 1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Grocery_Pharmacy 0.9728199 -34.0
New Hampshire Parks 0.9495516 -20.0
Hawaii Parks 0.9427290 -72.0
Hawaii Retail_Recreation 0.9343342 -56.0
Vermont Parks 0.9336486 -35.5
Maine Transit -0.9183518 -50.0
Hawaii Transit 0.9121202 -89.0
Connecticut Grocery_Pharmacy -0.8838185 -6.0
Utah Residential -0.8669043 12.0
Utah Transit -0.8575675 -18.0
South Dakota Parks 0.7991124 -26.0
Arizona Grocery_Pharmacy -0.7758701 -15.0
Rhode Island Workplace -0.7677330 -39.5
Wyoming Parks -0.7563518 -4.0
Connecticut Transit -0.7516698 -50.0
Massachusetts Workplace -0.7510864 -39.0
Alaska Grocery_Pharmacy -0.7370185 -7.0
Arizona Residential 0.7359479 13.0
Alaska Residential 0.7356425 13.0
Nevada Transit -0.6863188 -20.0
Alaska Workplace -0.6659190 -33.0
Vermont Grocery_Pharmacy -0.6543390 -25.0
New York Workplace -0.6537114 -34.5
North Dakota Parks 0.6395488 -34.0
Idaho Residential -0.6321297 11.0
Rhode Island Retail_Recreation -0.6311475 -45.0
Utah Parks -0.6058483 17.0
New Jersey Parks -0.6015287 -6.0
Rhode Island Residential -0.5983355 18.5
New York Retail_Recreation -0.5903114 -46.0
Maine Workplace -0.5902779 -30.0
Nebraska Workplace 0.5815087 -32.0
Utah Workplace -0.5391453 -37.0
Arkansas Parks 0.5357150 -12.0
New York Parks 0.5296876 20.0
New Jersey Workplace -0.5275257 -44.0
Arizona Retail_Recreation -0.5254095 -42.5
Connecticut Retail_Recreation -0.5153478 -45.0
Connecticut Residential 0.5079490 14.0
Maine Parks 0.5005321 -31.0
Massachusetts Retail_Recreation -0.4948592 -44.0
New Jersey Grocery_Pharmacy -0.4868377 2.5
New Mexico Grocery_Pharmacy -0.4845342 -11.0
Connecticut Workplace -0.4705173 -39.0
Nebraska Residential -0.4658829 14.0
Arizona Transit 0.4657852 -38.0
New Mexico Residential 0.4585572 13.5
Maryland Workplace -0.4563523 -35.0
Rhode Island Parks 0.4481230 52.0
Montana Parks -0.4443266 -58.0
Iowa Parks -0.4429655 28.5
North Dakota Retail_Recreation -0.4305768 -42.0
Illinois Transit -0.4247279 -31.0
Kansas Parks 0.4245358 72.0
West Virginia Parks 0.4205932 -33.0
Pennsylvania Workplace -0.4187592 -36.0
New Jersey Retail_Recreation -0.4162814 -62.5
Montana Residential 0.4122963 14.0
Maryland Grocery_Pharmacy -0.4118167 -10.0
New Jersey Transit -0.4101117 -50.5
Hawaii Residential -0.4099228 19.0
Massachusetts Grocery_Pharmacy -0.4067505 -7.0
New Mexico Parks 0.4002503 -31.5
Kentucky Parks -0.3996996 28.5
Vermont Residential 0.3985478 11.5
Pennsylvania Retail_Recreation -0.3916974 -45.0
New Hampshire Residential -0.3908028 14.0
South Carolina Workplace 0.3848222 -30.0
Oregon Parks -0.3844636 16.5
Alabama Workplace -0.3823198 -29.0
Missouri Residential -0.3788781 13.0
Michigan Parks 0.3735526 28.5
New Mexico Retail_Recreation -0.3701970 -42.5
New York Transit -0.3663517 -48.0
Alabama Grocery_Pharmacy -0.3635411 -2.0
South Carolina Parks -0.3464988 -23.0
Nebraska Grocery_Pharmacy 0.3412825 -0.5
North Dakota Workplace 0.3364090 -40.0
Wisconsin Transit -0.3324070 -23.5
Arkansas Retail_Recreation -0.3258501 -30.0
Maryland Retail_Recreation -0.3235784 -39.0
Montana Transit 0.3175023 -41.0
Virginia Transit -0.3164594 -32.5
Idaho Workplace -0.3132267 -29.0
Maine Retail_Recreation -0.3131741 -42.0
Alaska Retail_Recreation 0.3103168 -39.0
California Residential 0.3029900 14.0
Florida Residential 0.3012987 14.0
Colorado Residential 0.3008882 14.0
Idaho Grocery_Pharmacy -0.2991008 -5.5
Nevada Residential 0.2974782 17.0
Illinois Workplace -0.2952780 -30.5
California Parks -0.2929723 -38.5
California Transit -0.2923557 -42.0
Idaho Transit -0.2841702 -30.0
Minnesota Transit -0.2837067 -28.5
Pennsylvania Parks 0.2799316 12.0
Vermont Retail_Recreation 0.2727294 -57.0
Wyoming Grocery_Pharmacy -0.2717083 -10.0
Pennsylvania Grocery_Pharmacy -0.2657156 -6.0
North Carolina Workplace 0.2651901 -31.0
Oregon Grocery_Pharmacy -0.2639225 -7.0
Vermont Workplace -0.2638060 -43.0
Maryland Residential 0.2629831 15.0
Kansas Workplace 0.2610229 -32.5
North Carolina Grocery_Pharmacy 0.2598531 0.0
Georgia Grocery_Pharmacy -0.2586512 -10.0
Rhode Island Grocery_Pharmacy 0.2579082 -7.5
West Virginia Grocery_Pharmacy -0.2568512 -6.0
Alabama Transit -0.2564904 -36.5
Illinois Parks 0.2520903 26.5
Mississippi Residential 0.2511330 13.0
Tennessee Workplace -0.2496358 -31.0
Rhode Island Transit -0.2479675 -56.0
Missouri Workplace 0.2471773 -29.0
Texas Workplace 0.2426049 -32.0
Nevada Workplace -0.2402580 -40.0
Tennessee Residential 0.2396904 11.5
Wisconsin Parks 0.2362244 51.5
Florida Parks -0.2349950 -43.0
Wyoming Workplace -0.2343198 -31.0
Washington Grocery_Pharmacy 0.2332811 -7.0
Minnesota Parks 0.2330226 -9.0
Georgia Workplace -0.2286407 -33.5
Nevada Retail_Recreation -0.2286197 -43.0
New York Grocery_Pharmacy -0.2281058 8.0
Tennessee Parks -0.2260263 10.5
South Dakota Workplace 0.2205607 -35.0
Georgia Retail_Recreation -0.2179653 -41.0
Arizona Parks -0.2167995 -44.5
Indiana Parks -0.2163231 29.0
Texas Residential -0.2065612 15.0
Nebraska Transit -0.2046266 -9.0
North Dakota Grocery_Pharmacy -0.2032393 -8.0
Missouri Parks 0.2021412 0.0
Oregon Residential -0.2001144 10.5
Kansas Grocery_Pharmacy -0.1998225 -14.0
North Carolina Transit 0.1994367 -32.0
Connecticut Parks 0.1992434 43.0
West Virginia Workplace 0.1942153 -33.0
Illinois Residential 0.1941187 14.0
Mississippi Grocery_Pharmacy -0.1934143 -8.0
Alabama Parks 0.1927882 -1.0
Colorado Parks -0.1881024 2.0
South Dakota Residential 0.1823817 15.0
Wisconsin Residential -0.1800080 14.0
Virginia Residential 0.1788756 14.0
North Carolina Residential 0.1771377 13.0
Wyoming Transit -0.1709584 -17.0
Pennsylvania Transit -0.1705068 -42.0
New Hampshire Retail_Recreation -0.1698643 -41.0
Ohio Transit 0.1664820 -28.0
Kentucky Transit -0.1663895 -31.0
Kentucky Grocery_Pharmacy 0.1656510 4.0
Massachusetts Parks 0.1627565 39.0
Indiana Residential 0.1625852 12.0
Utah Retail_Recreation -0.1625823 -40.0
Oklahoma Parks 0.1622177 -18.5
Texas Parks 0.1615510 -42.0
Idaho Retail_Recreation -0.1567754 -39.5
New Mexico Transit 0.1524773 -38.5
Idaho Parks 0.1522135 -22.0
New Jersey Residential 0.1510671 18.0
Kentucky Residential 0.1430525 12.0
Montana Workplace -0.1414141 -40.0
North Dakota Residential -0.1406521 17.0
Minnesota Retail_Recreation 0.1376259 -40.5
Virginia Grocery_Pharmacy -0.1341308 -8.0
Texas Grocery_Pharmacy 0.1315634 -14.0
West Virginia Residential -0.1315428 11.0
Montana Retail_Recreation 0.1303051 -51.0
Iowa Transit 0.1296912 -24.0
North Dakota Transit 0.1285819 -48.0
North Carolina Retail_Recreation 0.1259155 -34.0
Indiana Retail_Recreation 0.1252195 -38.0
Mississippi Retail_Recreation -0.1250649 -40.0
Michigan Workplace -0.1243474 -40.0
Wisconsin Grocery_Pharmacy 0.1230286 -1.0
Utah Grocery_Pharmacy 0.1203156 -4.0
California Grocery_Pharmacy -0.1162171 -11.5
Missouri Transit -0.1152449 -24.5
Oregon Transit 0.1109754 -27.5
Wisconsin Workplace -0.1090395 -31.0
Virginia Parks 0.1063732 6.0
Nebraska Retail_Recreation 0.1062960 -36.0
Oklahoma Grocery_Pharmacy -0.1060557 -0.5
Arkansas Workplace -0.1058681 -26.0
Florida Transit -0.1055804 -49.0
Alabama Retail_Recreation 0.1055784 -39.0
Mississippi Parks -0.1055566 -25.0
Iowa Workplace 0.1054336 -30.0
Maryland Transit -0.1050626 -39.0
Florida Retail_Recreation 0.1047209 -43.0
New Hampshire Grocery_Pharmacy -0.1024118 -6.0
Massachusetts Transit -0.1011312 -45.0
Kansas Transit -0.1008959 -26.5
Massachusetts Residential 0.0995446 15.0
Oregon Retail_Recreation 0.0992354 -40.5
California Retail_Recreation -0.0986908 -44.0
Wyoming Residential 0.0974292 12.5
Mississippi Transit -0.0963377 -38.5
Minnesota Grocery_Pharmacy 0.0962171 -6.0
Michigan Transit 0.0960288 -46.0
Oklahoma Workplace 0.0945010 -31.0
Indiana Workplace 0.0940380 -34.0
Iowa Grocery_Pharmacy -0.0928793 4.0
New York Residential 0.0898338 17.5
West Virginia Transit -0.0894801 -45.0
Hawaii Workplace 0.0887792 -46.0
Michigan Residential 0.0861561 15.0
Ohio Residential 0.0855623 14.0
Michigan Retail_Recreation -0.0838930 -53.0
Minnesota Residential -0.0802688 17.0
Alabama Residential -0.0802246 11.0
Virginia Retail_Recreation -0.0797060 -35.0
Oregon Workplace -0.0765583 -31.0
Pennsylvania Residential 0.0761331 15.0
Nevada Parks 0.0753255 -12.5
Texas Retail_Recreation 0.0736083 -40.0
Kentucky Retail_Recreation 0.0731082 -29.0
Texas Transit 0.0713251 -41.5
Ohio Grocery_Pharmacy 0.0700486 0.0
South Dakota Transit -0.0679467 -40.0
Georgia Parks 0.0671370 -6.0
Georgia Residential -0.0659419 13.0
South Carolina Transit -0.0649883 -45.0
Washington Transit -0.0642711 -33.5
Virginia Workplace -0.0631159 -32.0
North Carolina Parks -0.0615784 7.0
Vermont Transit -0.0569120 -63.0
Oklahoma Residential -0.0552293 15.0
Washington Retail_Recreation 0.0552147 -42.0
Maine Residential -0.0545410 11.0
Georgia Transit -0.0511523 -35.0
Ohio Parks -0.0510636 67.5
South Dakota Retail_Recreation -0.0500546 -39.0
West Virginia Retail_Recreation -0.0496500 -38.5
Iowa Retail_Recreation 0.0490979 -38.0
Tennessee Transit 0.0474679 -32.0
Ohio Retail_Recreation 0.0468041 -36.0
Indiana Transit 0.0467623 -29.0
Arkansas Residential 0.0460404 12.0
New Hampshire Workplace 0.0453023 -37.0
Minnesota Workplace -0.0439054 -33.0
Missouri Retail_Recreation -0.0438983 -36.0
Arkansas Grocery_Pharmacy -0.0436977 3.0
Kentucky Workplace -0.0434920 -36.5
Colorado Transit 0.0434030 -36.0
South Carolina Residential -0.0432593 12.0
Montana Grocery_Pharmacy -0.0395757 -16.0
Florida Grocery_Pharmacy 0.0390905 -14.0
New Hampshire Transit 0.0383483 -57.0
Ohio Workplace -0.0382727 -35.0
Nebraska Parks 0.0374187 55.5
New Mexico Workplace 0.0354183 -34.0
California Workplace -0.0348282 -36.0
Wyoming Retail_Recreation 0.0343102 -39.0
Illinois Retail_Recreation 0.0341780 -40.0
Arizona Workplace -0.0336982 -35.0
Illinois Grocery_Pharmacy -0.0314724 2.0
Maine Grocery_Pharmacy -0.0311804 -13.0
Colorado Retail_Recreation -0.0309671 -44.0
Colorado Grocery_Pharmacy -0.0293829 -17.0
Tennessee Retail_Recreation -0.0275179 -30.0
Washington Parks 0.0197340 -3.5
Wisconsin Retail_Recreation 0.0190376 -44.0
Arkansas Transit -0.0176843 -27.0
Kansas Residential -0.0157613 13.0
Florida Workplace 0.0156913 -33.0
Maryland Parks 0.0156233 27.0
Mississippi Workplace -0.0152403 -33.0
Oklahoma Transit 0.0143374 -26.0
South Carolina Retail_Recreation 0.0125976 -35.0
Colorado Workplace 0.0120842 -39.0
South Dakota Grocery_Pharmacy -0.0117423 -9.0
Indiana Grocery_Pharmacy -0.0113221 -5.5
Tennessee Grocery_Pharmacy 0.0109042 6.0
Washington Workplace -0.0101875 -38.0
Oklahoma Retail_Recreation -0.0099590 -31.0
Missouri Grocery_Pharmacy -0.0079501 2.0
South Carolina Grocery_Pharmacy 0.0070635 1.0
Nevada Grocery_Pharmacy -0.0053851 -12.5
Michigan Grocery_Pharmacy -0.0035823 -11.0
Kansas Retail_Recreation -0.0019910 -38.0
Iowa Residential -0.0001982 13.0
Washington Residential -0.0000910 13.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Sun Jun 21 20:02:41 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net